# Clear the environment
rm(list = ls())
# Load in libraries
pacman::p_load(tidyverse, dplyr, stringr, here, readxl, pals, broom, knitr,
janitor, dagitty, ggdag, ggplot2, kableExtra, tools)Github Repository: https://github.com/jwonyk/crop-water-gamma
Introduction
About Data
California produces a large portion of the nation’s fruits, nuts, and vegetables. Irrigated agriculture dominates water use in the state, and the amount of water applied varies by crop type, hydrologic region, and year.1 The goal of this analysis is to understand how applied water per acre varies across crop types, hydrologic regions, and years using statewide data from 2016 - 2020.
This dataset provides annual estimates of:
- Applied Water per Acre
- Hydrologic Regions
- 20 Crop Categories
- Years 2016 - 2020
Our Research question is:
How do crop type, hydrologic region, and year influence the amount of applied irrigation water per acre in California?
We will explore and examine the data and find what statistical model is appropriate.
Data Access
California Department of Water Sources - Statewide Agricultural Land & Water Use (2016 -2020)
Clear the Environment and Load Libraries
Load and Explore Data
In this section, we load the raw Excel file and prepare it for analysis. Real datasets often contain unnecessary columns, inconsistent variable names, or values that must be converted into usable formats. We use functions like clean_names(), select(), and pivot_longer() to transform the data into a tidy structure.
# Read in data and clean the name to snake case and drop unused columns
water_raw <- read_excel(here("posts", "crop-water-gamma", "data",
"agricultural_water_use_data_2016_2020.xlsx"),
sheet = "Statewide_AW_Unit", skip = 1) %>%
clean_names() %>%
select(!starts_with("x"))
# Pivot all crop columns, convert variable types, and drop missing rows
water_long <- water_raw %>%
pivot_longer(cols = -c(year, hr),
names_to = "crop",
values_to = "aw_per_acre") %>%
mutate(aw_per_acre = as.numeric(aw_per_acre),
year = as.integer(year),
hr = as.factor(hr),
crop = as.factor(crop)) %>%
filter(!is.na(aw_per_acre),
aw_per_acre > 0)Data Cleaning Method and Explanation
The raw Excel file had many generated placeholders with columns labeled X1, X2, etc. which contained no data. We removed these using select(!start_with("x")). The dataset is structured in a wide format where each crop is stored in its own column. In order for us to fit regression model, we need to put the crops into one column which was achieved using pivot_longer() which converts into a tidy dataset that represents year-region-crop combination that corresponding with applied water per acre.
We also used clean_names() from janitor package that standardized all column names in snake case throughout the analysis for better readability.
Some observations showed zero applied water, which occurred for several reasons, such as the field not needing irrigation due to adequate rainfall, the land being empty and requiring no irrigation, or other reasons.
Because our research question asks how crop type, hydrologic region, and year influence the amount of irrigation water applied per acre, our focus is on variation in irrigation intensity among fields that were actually irrigated, not on the separate decision to withhold irrigation. For this reason, we restrict our analysis to positive applied-water observations and filter out zeroes.
Exploratory Visualization
Before fitting any statistical model, it is essential to understand the distribution of the variables. The violin plot visualizes how applied water per acre varies across crop types and reveals skewness, outliers, and general patterns. This helps justify the modeling choices and provides intuition about which predictors might influence water use.
Code
# Create a consistent color palette for crop types
crop_cols <- setNames(alphabet()[1:length(levels(water_long$crop))],
levels(water_long$crop))
water_long %>%
ggplot(aes(x = crop, y = aw_per_acre, fill = crop)) +
geom_violin(trim = FALSE, alpha = 0.7) +
geom_boxplot(width = 0.1, outlier.shape = NA, alpha = 0.5) +
scale_fill_manual(values = crop_cols) +
coord_flip() +
theme_bw() +
theme(legend.position = "none",
plot.title = element_text(size = 14, face = "bold", hjust = 0),
axis.title.y = element_blank()) +
labs(y = "Applied Water per Acre")The violin plot shows that applied water per acre is strongly right-skewed for most crops. The widest portions of the violins occur at lower values, indicating that most observations involve relatively low water use. At the same time, long right tails reflect a small number of farm-years with much higher water application. Most observations fall between 1 and 3 acre-feet per acre, with a few extreme values extending the distribution to the right.
Here, we are noticing some violin plots extending below zero on the x-axis. This phenomenon is called Kernel Density Estimation where the violin plots use to smooth the distribution of the data and provide a visual estimation of the entire population’s probable values, not just the observed data points.2
DAG (Directed Acyclic Graph)
A Directed Acyclic Graph clarifies our assumptions about how variables influence each other. Let’s dive into our variables:
Crop Types influence water use because different crops require different irrigation levels
The hydrologic region affects crop type because different regions have different crops
Both year and hydrologic influence applied water per acre due to climate and resource availability
These relationships reflect scientific justification, including crop, hydrological region, and predictors in the regression model.
Code
# Draw a DAG
water_dag <- dagitty("dag{
hr -> crop
hr -> aw_per_acre
year -> crop
year -> aw_per_acre
crop -> aw_per_acre}")
# Assign coordinate for each variable nodes
coordinates(water_dag) <- list(
x = c(hr = 0, year = 2, crop = 1, aw_per_acre = 3),
y = c(hr = 2, year = 2, crop = 1, aw_per_acre = 1))
# Assign color for the nodes
node_colors <- c(hr = "#7F95D1", year = "#FF82A9",
crop = "#FA824C", aw_per_acre = "#1B998B")
# Plot DAG
ggdag(water_dag, text = FALSE, node = FALSE) +
geom_dag_node(aes(fill = name), size = 12, shape = 21, color = "white") +
geom_dag_edges() +
scale_fill_manual(values = node_colors, name = "Variables",
labels = c(aw_per_acre = "Applied Water per Acre",
crop = "Crop Type",
hr = "Hydrologic Region",
year = "Year")) +
theme_minimal() +
theme(legend.title = element_text(size = 15, face = "bold", hjust = 0.5),
legend.position = "right",
legend.text = element_text(size = 12, face = "bold"),
axis.title = element_blank(),
axis.text = element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank(),
panel.background = element_blank(),
plot.margin = margin(10, 10, 10, 10))The DAG summarizes our assumptions about how crop type, hydrologic region, and year influence applied water per acre. Hydrologic region influences crop type because certain crops grow better in specific climates and soils. Both hydrologic region and year influences the response variable due to variation in precipitation, groundwater availability, and climate condition. Year affects crop distribution annually as they response directly to the market and climate.
Statistical Modeling
Why Gamma?
The Gamma Distribution with a log link would be appropriate for our dataset because applied water per acre is positive, continuous, and right-skewed. Rather than adding or subtracting a fixed amount of water, the model describes how water use scales up or down from the baseline. Each coefficient shows how much lower or higher expected water use is relative to the baseline condition. In other words, instead of saying “this crop uses n amount more acre-foot of water”, the model would tell us that the crop or region would use n% more / less water than average.
An interaction between crop type and hydrologic region could further expand our research scope to include region-specific crop water requirements. We will revisit a different model for this next time.
Gamma Model in Statistical Notation
The mathematical form of the model explains this notation:
\[ \begin{align*} &\text{Applied Water} \sim Gamma(\mu, \phi) \\ \log(\mu) = \beta_0 + &\beta_1 (\text{Crop}) + \beta_2 (\text{Hydrologic Region}) + \beta_3 (\text{Year}) \end{align*} \]
The linear predictor for the mean is on the log scale, and the parameter of Gamma consists of the following:
\(\mu\) is expected applied water per acre
\(\phi\) is shape parameter and it controls with its size.
Larger \(\phi\) - distribution becomes more symmetric
Smaller \(\phi\) - distribution becomes more right-skewed and more spread out
The mathematical form of the log link for Gamma model:
\[ \text {Expected water use} = exp(\text {linear predictor or }\beta) \]
Fit Model to Simulated Data
To demonstrate how the Gamma model works under the known parameters, we simulated 1000 observations from a Gamma distribution with a log-linear mean structure. We set the \(\beta_0 = 1\) and a slope \(\beta_1 = -0.4\).
# Initialize simulated random data
set.seed(123)
n <- 1000 # Sample Size
beta0 <- 1 # Intercept
beta1 <- -0.4 # Slope, controls strength and direction
phi <- 4 # Gamma Shape
# Simulated a positive and right-skewed predictor
predictor <- rgamma(n, shape = 2, scale = 1)
# Simulated Gamma-distribute response with mean
response <- rgamma(n, shape = phi,
scale = exp(beta0 + beta1 * predictor) / phi)
sim_data <- tibble(predictor, response)
sim_model <- glm(response ~ predictor,
data = sim_data,
family = Gamma(link = "log"))
sim_results <- tidy(sim_model, conf.int = TRUE)
sim_results %>%
kbl(digits = 3,
caption = "Gamma Regression Coefficients from Simulated Data (with 95% CI)",
col.names = c("Term", "Estimate", "Std. Error",
"Statistic", "p-value",
"CI Lower", "CI Upper")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, background = "#f0f0f0")| Term | Estimate | Std. Error | Statistic | p-value | CI Lower | CI Upper |
|---|---|---|---|---|---|---|
| (Intercept) | 1.017 | 0.027 | 37.187 | 0 | 0.963 | 1.071 |
| predictor | -0.415 | 0.012 | -34.652 | 0 | -0.439 | -0.392 |
Increasing the sample size to 1000 observations leads to tighter confidence intervals and demonstrates that the Gamma model can reliably recover the assumed parameters.
Parameter Recovery for Simulated Data
During the simulation, the true parameters were:
- Intercepts: \(\beta_0 = 1\)
- Slope: \(\beta_1 = -0.4\)
The fitted Gamma regression model estimated coefficients very close to the true values we selected. The intercept estimate is approximately 1, and the slope estimate is approximately -0.4, indicating that the Gamma model with a log link successfully recovers the underlying parameters when the model assumptions hold. The simulation validates that the Gamma model applies to the water data.
Code
# Plot the simulated data with estimates
ggplot(sim_data, aes(x = predictor, y = response)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "glm",
method.args = list(family = Gamma(link = "log")),
se = TRUE, color = "red", fill = "blue", size = 1) +
theme_bw() +
labs(x = "Predictor", y = "Response")Fit to Real Life Data
After confirming that the Gamma model performs well on simulated data, we fit the model. This step allows us to quantify how applied water per acre varies across crop types, hydrologic regions, and years in real-world conditions.
Code
# Fit Gamma Regression using log link to model applied water per acre
water_gamma_model <- glm(aw_per_acre ~ crop + hr + factor(year),
family = Gamma(link = "log"),
data = water_long)
# Make the table presentable
water_results <- tidy(water_gamma_model, conf.int = TRUE) %>%
mutate(term = gsub("^crop", "Crop: ", term),
term = gsub("^hr", "Region: ", term),
term = gsub("factor\\(year\\)", "Year: ", term),
term = gsub("Region: \\d+_", "Region: ", term),
term = gsub("other_field_Crop: s", "Other Field Crops", term),
term = gsub("truck_Crop: s", "Truck Crops", term),
term = gsub("_", " ", term),
term = toTitleCase(term)) %>%
kable(format = "html",
caption = "Gamma Regression Coefficients with 95% Confidence Intervals",
col.names = c("Term", "Estimate", "Std. Error", "Statistic", "p-value",
"CI (Lower)", "CI (Upper)"), digits = 3,
align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE, position = "center") %>%
row_spec(0, bold = TRUE, background = "#f0f0f0")
# Show table
water_results| Term | Estimate | Std. Error | Statistic | p-value | CI (Lower) | CI (Upper) |
|---|---|---|---|---|---|---|
| (Intercept) | 1.281 | 0.041 | 31.156 | 0.000 | 1.200 | 1.363 |
| Crop: Almonds Pistachios | -0.091 | 0.043 | -2.099 | 0.036 | -0.175 | -0.006 |
| Crop: Avg Aw | -0.338 | 0.041 | -8.283 | 0.000 | -0.419 | -0.258 |
| Crop: Citrus Subtropical | -0.187 | 0.042 | -4.454 | 0.000 | -0.269 | -0.105 |
| Crop: Corn | -0.419 | 0.041 | -10.097 | 0.000 | -0.500 | -0.338 |
| Crop: Cotton | -0.132 | 0.052 | -2.522 | 0.012 | -0.234 | -0.029 |
| Crop: Cucurbits | -0.685 | 0.042 | -16.265 | 0.000 | -0.768 | -0.603 |
| Crop: Dry Beans | -0.675 | 0.045 | -15.083 | 0.000 | -0.763 | -0.587 |
| Crop: Grain | -1.391 | 0.041 | -34.037 | 0.000 | -1.471 | -1.311 |
| Crop: Onions Garlic | -0.669 | 0.042 | -15.756 | 0.000 | -0.752 | -0.586 |
| Crop: Other Deciduous | -0.233 | 0.041 | -5.656 | 0.000 | -0.314 | -0.152 |
| Crop: Other Field Crops | -0.237 | 0.041 | -5.717 | 0.000 | -0.318 | -0.156 |
| Crop: Pasture | 0.101 | 0.041 | 2.476 | 0.013 | 0.021 | 0.181 |
| Crop: Potatoes | -0.149 | 0.044 | -3.390 | 0.001 | -0.236 | -0.063 |
| Crop: Rice | 0.087 | 0.057 | 1.524 | 0.128 | -0.024 | 0.199 |
| Crop: Safflower | -0.581 | 0.046 | -12.555 | 0.000 | -0.671 | -0.490 |
| Crop: Sugar Beets | -0.247 | 0.156 | -1.589 | 0.112 | -0.539 | 0.072 |
| Crop: Tomato Fresh | -0.626 | 0.112 | -5.573 | 0.000 | -0.840 | -0.399 |
| Crop: Tomato Processing | -0.454 | 0.045 | -10.140 | 0.000 | -0.541 | -0.366 |
| Crop: Truck Crops | -0.770 | 0.041 | -18.838 | 0.000 | -0.850 | -0.689 |
| Crop: Vineyard | -0.293 | 0.042 | -6.983 | 0.000 | -0.375 | -0.211 |
| Region: San Francisco Bay | 0.005 | 0.036 | 0.132 | 0.895 | -0.067 | 0.076 |
| Region: Central Coast | 0.038 | 0.035 | 1.065 | 0.287 | -0.032 | 0.107 |
| Region: South Coast | 0.237 | 0.036 | 6.619 | 0.000 | 0.167 | 0.308 |
| Region: Sacramento River | 0.181 | 0.035 | 5.190 | 0.000 | 0.113 | 0.250 |
| Region: San Joaquin River | 0.201 | 0.035 | 5.767 | 0.000 | 0.133 | 0.270 |
| Region: Tulare Lake | 0.200 | 0.035 | 5.667 | 0.000 | 0.130 | 0.269 |
| Region: North Lahontan | 0.121 | 0.043 | 2.809 | 0.005 | 0.037 | 0.206 |
| Region: South Lahontan | 0.563 | 0.037 | 15.152 | 0.000 | 0.490 | 0.636 |
| Region: Colorado River | 0.581 | 0.035 | 16.458 | 0.000 | 0.511 | 0.650 |
| Region: State | 0.242 | 0.035 | 6.955 | 0.000 | 0.174 | 0.310 |
| Year: 2017 | -0.109 | 0.023 | -4.766 | 0.000 | -0.153 | -0.064 |
| Year: 2018 | -0.043 | 0.023 | -1.881 | 0.060 | -0.089 | 0.002 |
| Year: 2019 | -0.137 | 0.023 | -5.963 | 0.000 | -0.182 | -0.092 |
| Year: 2020 | -0.062 | 0.023 | -2.678 | 0.008 | -0.107 | -0.016 |
Regions differ substantially in expected water use even after accounting for crop type and year. The Colorado River (0.58) and South Lahontan (0.56) regions show the largest positive coefficients. These regions are estimated to use approximately 75-80% more water per acre than the baseline region.
We know the percentage by calculating log link:
\(exp(\beta \text{ of Colorado River}) = exp(0.58) \approx 1.79\)
\(exp(\beta \text{ of South Laho.ntan}) = exp(0.56) \approx 1.75\)
These regional differences are consistent with climate patterns across California. Inland regions such as the Colorado River and South Lahontan are known to have hotter, drier conditions and higher evaporative demand (the atmosphere’s tendency to remove moisture from soil and plants), requiring higher irrigation. On the other hand, coastal regions such as the San Francisco Bay and Central Coast experience cooler temperatures, corresponding to lower irrigation needs.
Coefficient Plot
We visualize the estimated regression coefficients along with their 95% confidence intervals. This coefficient plot provides the uncertainty of each predictor’s association with applied water per acre.
Code
tidy(water_gamma_model, conf.int = TRUE) %>%
filter(grepl("^hr", term)) %>%
mutate(term = gsub("^hr\\d+_", "", term),
term = gsub("_", " ", term),
term = reorder(term, estimate)) %>%
ggplot(aes(x = term, y = estimate)) +
geom_point(size = 2) +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.15) +
coord_flip() +
theme_bw() +
labs(x = "Hydrologic Region", y = "Estimate (log scale)")Inference
We are interested in if hydrologic region is associated with differences in applied water per acre, after adding all consideration for crop types and year.
Explain Hypothesis in Plain-Language
Some hydrologic regions require more applied water per acre than others, even after adjusting for crop type and year.
\[ \begin{aligned} H_0:&\ \beta_r = 0 \quad \text{(no difference in expected water use)} \\ H_A:&\ \beta_r \neq 0 \quad \text{(a region differs from the baseline region)} \end{aligned} \]
Visualizing Hypothesis
Before interpreting hypothesis test results, use visualization to examine applied water per acre by hydrologic region. The figure will help to evaluate how water use differs across regions in the observed data and results of the Gamma regression model.
Code
# Plot
ggplot(water_long, aes(x = hr, y = aw_per_acre, fill = hr)) +
geom_violin(trim = FALSE, alpha = 0.5) +
geom_boxplot(width = 0.15, outlier.shape = NA) +
scale_fill_brewer(palette = "Set3") +
scale_x_discrete(labels = function(x)
{x <- gsub("^\\d+_", "", x)
x <- gsub("_", " ", x)}) +
labs(x = "Hydrologic Region",
y = "Applied Water per Acre") +
theme_minimal() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1))The combined violin and box plot displays both the full distribution of water use within each region and key summary statistics, allowing us to evaluate differences in central tendency, variability, and skewness across regions.
Conclusion
This analysis shows that the hydrologic region is a key determinant of agricultural irrigation intensity in California. Even after accounting for crop type and year, inland regions consistently require more applied water per acre than coastal regions, as the hotter, drier air in inland climates causes water to evaporate more quickly from soils and crops than crop composition alone. Supported by the hypothesis visualization, which shows shifts in the distribution, reinforcing the model-based evidence of regional differences. By focusing on irrigated fields, this analysis isolates variation in irrigation intensity but does not model irrigation decisions or region-specific crop responses. Future work could incorporate crop–region interactions to understand better how water requirements vary across California’s diverse hydrologic landscapes.
Reference
California Department of Water Resources. (2020). Statewide Agricultural Water Use Data, 2016–2020 [data file]. Available: https://data.ca.gov/dataset/statewide-agricultural-water-use-data-2016-2020. [Accessed: Oct. 25, 2025]
Wikimedia Commons. (2024). Sisteme-de-irigatie.jpg [image file]. Available: https://commons.wikimedia.org/w/index.php?title=File:Sisteme-de-irigatie.jpg&oldid=838313884. [Accessed: Dec. 4, 2025]
Footnotes
California Department of Food and Agriculture. (2024). California agricultural production statistics [web page]. Available: https://www.cdfa.ca.gov/statistics.↩︎
Mode Analytics. (2021). Violin Plots 101: Visualizing Distribution and Probability Density [web page]. Available: https://mode.com/blog/violin-plot-examples.↩︎
Citation
@online{kim2025,
author = {Kim, Jay},
title = {How {Crop} {Type} and {Climate} {Zone} {Influence}
{Agricultural} {Water} {Use} in {California}},
date = {2025-12-11},
url = {https://jwonyk.github.io/posts/crop-water-gamma/},
langid = {en}
}